logo

Introduction

The purpose of this project is to gauge your technical skills and problem solving ability by working through something similar to a real NBA data science project. You will work your way through this R Markdown document, answering questions as you go along. Please begin by adding your name to the “author” key in the YAML header. When you’re finished with the document, come back and type your answers into the answer key at the top. Please leave all your work below and have your answers where indicated below as well. Please note that we will be reviewing your code so make it clear, concise, and avoid long printouts. Feel free to add in as many new code chunks as you’d like.

Remember that we will be grading the quality of your code and visuals alongside the correctness of your answers. Please try to use the tidyverse as much as possible (instead of base R and explicit loops). Please do not bring in any outside data, and use the provided data as truth (for example, some “home” games have been played at secondary locations, including TOR’s entire 2020-21 season. These are not reflected in the data and you do not need to account for this.) Note that the OKC and DEN 2024-25 schedules in schedule_24_partial.csv intentionally include only 80 games, as the league holds 2 games out for each team in the middle of December due to unknown NBA Cup matchups. Do not assign specific games to fill those two slots.

Note:

Throughout this document, any season column represents the year each season started. For example, the 2015-16 season will be in the dataset as 2015. We may refer to a season by just this number (e.g. 2015) instead of the full text (e.g. 2015-16).

Answers

Part 1

Question 1: 26 4-in-6 stretches in OKC’s draft schedule.

Question 2: 24.6 4-in-6 stretches on average.

Question 3:

  • Most 4-in-6 stretches on average: CHA (27.8)
  • Fewest 4-in-6 stretches on average: NYK (21.8)

Question 4: This is a written question. Please leave your response in the document under Question 4.

Question 5:

  • BKN Defensive eFG%: 54.3%
  • When opponent on a B2B: 53.5%

Part 2

Please show your work in the document, you don’t need anything here.

Part 3

Question 9:

  • Most Helped by Schedule: MIL (+44.7 wins)
  • Most Hurt by Schedule: DET (-57.9 wins)

Setup and Data

library(plotly)
## Loading required package: ggplot2
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
library(tidyverse)
# Note, you will likely have to change these paths. If your data is in the same folder as this project, 
# the paths will likely be fixed for you by deleting ../../Data/schedule_project/ from each string.
schedule <- read_csv("schedule.csv")
draft_schedule <- read_csv("schedule_24_partial.csv")
locations <- read_csv("locations.csv")
game_data <- read_csv("team_game_data.csv")

Part 1 – Schedule Analysis

In this section, you’re going to work to answer questions using NBA scheduling data.

Question 1

QUESTION: How many times are the Thunder scheduled to play 4 games in 6 nights in the provided 80-game draft of the 2024-25 season schedule? (Note: clarification, the stretches can overlap, the question is really “How many games are the 4th game played over the past 6 nights?”)

Note by me: This code counts overlapping stretches rather than distinct stretches.

# How many games are the 4th game played over the past 6 nights?
draft_schedule_df <- as.data.frame(draft_schedule) # Convert to data frame
okc_2024 <- draft_schedule_df[draft_schedule_df$team == "OKC",] # OKC schedule data frame

okc_2024 <- okc_2024 %>%
  arrange(gamedate) # arrange date 

count <- 0 # count for how many times Thunder will play 4 games in 6 nights

for (i in seq(1, nrow(okc_2024)-3)) { # stops at nrow - 3 so the loop doesn't go out of bounds
  first_game <- okc_2024$gamedate[i] 
  fourth_game <- okc_2024$gamedate[i+3]
  
  days_since_previousgame <- fourth_game - first_game # Stores difference in days
  
  if (days_since_previousgame == 5) {
    count <- count + 1  # Increases every time there is a 4 in 6 stretch
  }
  
}

paste("There are", count, "times this season that the Thunder will play 4 games in 6 nights. ")
## [1] "There are 26 times this season that the Thunder will play 4 games in 6 nights. "

ANSWER 1:

26 4-in-6 stretches in OKC’s draft schedule.

Question 2

QUESTION: From 2014-15 to 2023-24, what is the average number of 4-in-6 stretches for a team in a season? Adjust each team/season to per-82 games before taking your final average.

schedule_df <- as.data.frame(schedule)
schedule_df <- schedule_df %>% arrange(team, season, gamedate)

teams <- unique(schedule_df$team)

team_season_count <- data.frame(team = character(), season = numeric(), number_of_4in6 = numeric(), stringsAsFactors = FALSE)

# Loop through each team
for (team_now in teams) {

  specific_team <- schedule_df %>%
    filter(team == team_now)

  seasons <- unique(specific_team$season)

  # Loop through each season, for each team
  for (season_now in seasons) {

    # team's schedule for that specific season
    specific_season <- specific_team %>%
      filter(season == season_now) %>%
      arrange(gamedate) 

    count <- 0 # counts how many 4-in-6 stretches

    if (nrow(specific_season) >= 4) {
      for (i in seq(1, nrow(specific_season) -3)) { # 4-in-6 loop (I used the same one as Question 1)
        first_game <- specific_season$gamedate[i]
        fourth_game <- specific_season$gamedate[i+3]
        days_since_previousgame <- fourth_game - first_game

        if (days_since_previousgame == 5) {
          count <- count + 1
        }
      }
    }

    # Main dataframe
    team_season_count <- rbind(team_season_count, data.frame(team = team_now, season = season_now, number_of_4in6 = count))
    
  }
}

# Create games_played list so that it can be added as a new column to the main dataframe
games_played <- schedule_df %>%
  group_by(team, season) %>%
  summarize(games_played = n())
## `summarise()` has grouped output by 'team'. You can override using the
## `.groups` argument.
team_season_count <- cbind(team_season_count, games_played$games_played) # combine games played column to compute per 82
team_season_count <- team_season_count %>% 
  mutate(per_82_4in6 = (number_of_4in6 / games_played$games_played) * 82) # per82 takes into account the shortened 2019 & 2020 seasons

new_team_season_count <- team_season_count %>%    # new dataframe to showcase how many 4-in-6 stretches per season from 2014-2023
  group_by(team) %>%
  summarize(per_season_4in6 = mean(per_82_4in6)) # variable per_season_4in6 tells how many 4-in-6 stretches per season for that specific team

mean(new_team_season_count$per_season_4in6) # gives average of how many 4-in-6 stretches a team plays per season
## [1] 24.63766

ANSWER 2:

24.6 4-in-6 stretches on average.

Question 3

QUESTION: Which of the 30 NBA teams has had the highest average number of 4-in-6 stretches between 2014-15 and 2023-24? Which team has had the lowest average? Adjust each team/season to per-82 games.

Note: I used the same code as question 2, except for the final few lines since it had a similar premise but a different query.

schedule_df <- as.data.frame(schedule)
schedule_df <- schedule_df %>% arrange(team, season, gamedate)

teams <- unique(schedule_df$team)

team_season_count <- data.frame(team = character(), season = numeric(), number_of_4in6 = numeric(), stringsAsFactors = FALSE)

# Loop through each team
for (team_now in teams) {

  specific_team <- schedule_df %>%
    filter(team == team_now)

  seasons <- unique(specific_team$season)

  # Loop through each season, for each team
  for (season_now in seasons) {

    # team's schedule for that specific season
    specific_season <- specific_team %>%
      filter(season == season_now) %>%
      arrange(gamedate) 

    count <- 0 # counts how many 4-in-6 stretches

    if (nrow(specific_season) >= 4) {
      for (i in seq(1, nrow(specific_season) -3)) { # 4-in-6 loop (I used the same one as Question 1)
        first_game <- specific_season$gamedate[i]
        fourth_game <- specific_season$gamedate[i+3]
        days_since_previousgame <- fourth_game - first_game

        if (days_since_previousgame == 5) {
          count <- count + 1
        }
      }
    }

    # Main dataframe
    team_season_count <- rbind(team_season_count, data.frame(team = team_now, season = season_now, number_of_4in6 = count))
    
  }
}

# Create games_played list so that it can be added as a new column to the main dataframe
games_played <- schedule_df %>%
  group_by(team, season) %>%
  summarize(games_played = n())
## `summarise()` has grouped output by 'team'. You can override using the
## `.groups` argument.
team_season_count <- cbind(team_season_count, games_played$games_played) # combine games played column to compute per 82
team_season_count <- team_season_count %>% 
  mutate(per_82_4in6 = (number_of_4in6 / games_played$games_played) * 82) # per82 takes into account the shortened 2019 & 2020 seasons

new_team_season_count <- team_season_count %>%    # new dataframe to showcase how many 4-in-6 stretches per season from 2014-2023
  group_by(team) %>%
  summarize(per_season_4in6 = mean(per_82_4in6)) # variable per_season_4in6 tells how many 4-in-6 stretches per season for that specific team

new_team_season_count <- as.data.frame(new_team_season_count)
new_team_season_count[which.max(new_team_season_count$per_season_4in6), ] # extract row of team that has the most 4-in-6 stretches per season
##   team per_season_4in6
## 4  CHA        27.80919
new_team_season_count[which.min(new_team_season_count$per_season_4in6), ] # extract row of team that has the least 4-in-6 stretches per season
##    team per_season_4in6
## 20  NYK        21.78611

ANSWER 3:

  • Most 4-in-6 stretches on average: CHA (27.8)

  • Fewest 4-in-6 stretches on average: NYK (21.8)

Question 4

QUESTION: Is the difference between most and least from Q3 surprising, or do you expect that size difference is likely to be the result of chance?

ANSWER 4:

The difference between CHA and NYK’s 4-in-6 stretches per season seemed more of a result of chance. The mean and median of the 4-in-6 per season column are 24.64 and 24.82, respectively, indicating a marginally (negatively) skewed distribution, but nearly a normal and symmetric distribution given how close they are.

Furthermore, the range from the minimum value to both the median and mean is very close to the range from the maximum to the median and mean, which aligns with its nearly normal distribution. This makes the difference between the minimum and maximum seem more of a result of chance rather than a surprise.

Question 5

QUESTION: What was BKN’s defensive eFG% in the 2023-24 season? What was their defensive eFG% that season in situations where their opponent was on the second night of back-to-back?

# Defensive eFG% = (field goals made + 0.5 * 3-pointers made) / field goal attempts

game_data_df <- as.data.frame(game_data) # Convert to data frame
bkn <- game_data_df[game_data_df$def_team == "BKN", ] # Brooklyn defense only
bkn_2023 <- bkn[bkn$season == 2023, ]

fgm <- sum(bkn_2023$fgmade) # Total field goals allowed by Brooklyn in 2023-24
fga <- sum(bkn_2023$fgattempted) # Total field goals allowed by Brooklyn in 2023-24
fgm3 <- sum(bkn_2023$fg3made) # Total 3 pointers allowed by Brooklyn in 2023-24

bkn_def_efg <- (fgm + (0.5 * fgm3)) / (fga)
paste("Brooklyn Defensive eFG%: ", bkn_def_efg)
## [1] "Brooklyn Defensive eFG%:  0.543487250172295"
# Defensive effective field goal percentage when opponent was on second night of back-to-back
df_2023 <- game_data_df[game_data_df$season == 2023, ] # new dataframe for filtering: 2023-24 games only

df_2023 <- df_2023%>%arrange(off_team, gamedate) # arrange by team, in order of date

df_2023 <- df_2023 %>%
  group_by(off_team) %>%
  mutate(days_since_previousgame = gamedate - lag(gamedate, default = first(gamedate))) %>% # new variable for days since previous game for off_team
  mutate(off_team_b2b = ifelse(days_since_previousgame == 1, TRUE, FALSE)) # new boolean that determines if game is back-to-back 

bkn_2023 <- df_2023[df_2023$def_team == "BKN", ] # change dataframe to now have new variables and also grouped
bkn_opp_b2b <- bkn_2023[bkn_2023$off_team_b2b == TRUE, ]

b2b_fgm <- sum(bkn_opp_b2b$fgmade) # Total field goals allowed
b2b_fga <- sum(bkn_opp_b2b$fgattempted) # Total field goal attempts
b2b_fgm3 <- sum(bkn_opp_b2b$fg3made) # Total 3-pointers allowed

bkn_def_efg_b2b <- (b2b_fgm + (0.5 * b2b_fgm3)) / (b2b_fga)
paste("Brooklyn Defensive eFG% when opponent on a back-to-back: ", bkn_def_efg_b2b)
## [1] "Brooklyn Defensive eFG% when opponent on a back-to-back:  0.53490832157969"

ANSWER 5:

  • BKN Defensive eFG%: 54.3%
  • When opponent on a B2B: 53.5%